This dataset contains 1,599 red wines with 11 input features on the chemical properties of the wine and the output quality of the wine is based on at least 3 evaluations made by wine experts. The quality rating is on a scale of 0 (very bad) to 10 (very excellent).
My main goal of this analysis is to understand how chemical features affect quality of wine and to be able to predict the subjective quality of wine based on objective properties. However I will also look at other interesting relationships as I dig deeper into the dataset.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## Classes 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The median quality for red wines is 6.0 and mean quality is 5.636 which is lower than the median. The Min quality is 3.0 and Max quality is 8.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## FALSE
## 1599
## [1] 280
## [1] 0.1751094
## [1] 0.8248906
Quality is mostly between 5 and 7 and relatively symmetric which is consistent with the median and mean. All qualities are integers. 82% of wines are either 5 or 6, which means it probably won’t be very easy to predict wine quality because the majority of provided data have almost identical rating.
## 7.2
## 67
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The fixed acidity levels are rounghly centered around 7.5 g/dm^3, but the right tails is a little longer than the left. The mode of fixed acidity is 7.2 g/dm^3, median is 7.9 g/dm^3, mean is 8.32 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most volatile acidity levels are between 0.3 g/dm^3 and 0.7 g/dm^3. Median is 0.52 g/dm^3 and mean is 0.5278 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## FALSE TRUE
## 1467 132
## 0
## 132
The distribution of citric acid levels seems a little random with a few peaks. It’s worth noting the mode of citric acid is actually zero. Since citric acid can add freshness and flavor to wines, I wonder if these wines have low quality.
It turns out the quality distribution is not that different from that of the whole sample, which means other variable outweighed the citric acid level in the cases where wines have 0 citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most wines have residual sugar between 1.5 and 3. Median is 2.2 and mean is 2.539. But there are some wines have higher sugar levels, the highest residual sugar amount is 15.5 g/dm^3, which is still a lot lower than the threshold of what’s condidered as sweet(45 g/dm^3).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most wines have chlorides between 0.05 g/dm^3 and 0.1 g/dm^3. Median is 0.079 g/dm^3 and mean is 0.08747 g/dm^3. The chlorides of this sample go all the way up to 0.611 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Before transformation, the distribution of free sulfur dioxide looks long tailed. After transforming the data by taking log10 to better understand the distribution, I did not gain much new insight. The distribution peaks around 6 mg/dm^3. The median is 14.00 mg/dm^3 and mean is 15.87 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
##
## FALSE TRUE
## 1597 2
## [1] 278 289
The distribution of total sulfur dioxide is again long tailed peaking about 15 mg/dm^3. I did not observe interesting pattern after transforming the x variable with log10. There are two outliers one with total sulfur dioxide level at 278 mg/dm^3, the other at 289 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] 10
## [1] 71
## [1] 1518
Density distribution is centered around 0.9968 g/cm^3, and seems relatively symmetric. It’s likely to be highly correlated with alcohol, sugar and other features. Median is 0.9968 g/cm^3, mean is 0.9967 g/cm^3. For all wines, density remains very close to 1 g/cm^3(density of water), with a minimum of 0.9901 g/cm^3. 1518 out of wines have a density less than 1 g/cm^3, 10 wines have exactly 1, and 71 have a density larger than 1. Since alcohol density is lower than pure water, wines that are heavier than water must have significant amount of sugar and other chemicals(compared with alcohol) to bring the density up. I will compare the sugar to alcohol ratios in wines with different densities.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07087 0.18180 0.20950 0.23620 0.24750 1.32400
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2111 0.2615 0.3009 0.4450 0.3784 1.7110
Here are the density plots of sugar to alcohol ratios. The peak of heavier wines is to the right of the peak for lighter wines, which is to be expected. Both the median and mean of heavier wines are about 0.1 larger than that of lighter wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH values of most wines fall between 3 and 3.7. The Maximum pH is 4.01, so all wines are acidic. Median is 3.31, mean is 3.311. pH value is very likely to be highly correlated with the acidity features.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
##
## FALSE TRUE
## 1591 8
Most wines have sulphates between 0.5 and 0.8 g/dm^3. Only 8 wines have more than 1.5 g/dm^3 sulphates. Median is 0.62 g/dm^3, mean is 0.6581 g/dm^3. Sulphates level also contrinutes to sulfur dioxide levels.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## 9.5
## 139
##
## FALSE TRUE
## 1598 1
The distribution of alcohol levels peaks at 9.5%(mode of alcohol), the right tail is significantly longer than the left, expanding all the way to 14.9% which is also the only one larger 14%. Median is 10.2%, mean is 10.42%.
There are 1599 wines with 13 features. The first one “X” is simply the index, leaving us only 12 features(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). All features are numerics except that quality is integer.
Other observations:
Quality is the main feautre. I’d like to find out which features can be used to predict the quality.
Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide and alcohol are all likely to have effect in determining quality of wines.
Sugar to acid ratio was added to better understand difference between wines heavier than water and lighter wines.
There are a few things I’ve noticed:
Originally I thought residual sugar is also an important feature in determining quality, but now it seems that’s not the case.
First I’ll explore pairs of features with relatively high correlation coeficients. There are 2 paris that surprised me the most.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1880823 0.2807254
## sample estimates:
## cor
## 0.2349373
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.1731696 0.2948641
## sample estimates:
## cor
## 0.2349373
I was expecting a negative coefficient, but it was actually 0.23. Even 99% confidence interval doesn’t cross 0. However, after making the scatter plot, it looks like the distribution is somewhat random.
After reading wikipedia, I found out 1.0 molar concentration acetic acid(volatile acid) has a pH of 2.4, citric acid with the same concentration has 1.57 pH. The molar mass of citric acid is also more than 3 times that of acetic acid. So if we hold the density of both acids the same, the pH of acetic acid will be a lot higher than that of citric acid. In a extreme case, if we were to add acetic acid to pure citric acid, I’d expect the pH of the mixed acid might increase. In reality, acetic acid is not added to wine not pure citric acid, and the coefficient between pH and acetic acid is not high, it’s still making some sense to me how it can be positive now. Of course, correlation does not imply causation, maybe the real reason is other features included or even not included in the dataset that caused the change in pH and happened to coincide with the change of acetic acid content.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
At first, I was really suprised by the relatively large negative correlation coefficient between citric acid and acetic acid since I thought they’re mostly independent of each other. But after searching online, I’ve learned during fermentation, citric acid has a tendency to be converted into acetic acid, which can potentially explain the negative correlation coefficient: more volatile acid just means more citric acid has been converted into acetic acid.
Out of all the features, alcohol is the one with the highest correlation coefficient with quality. Next I will look at the scatter plot of quality vs. alcohol.
The vertical strips indicate all quality take integer numbers. Overall, the quality increases with more alcohol. The red line is the median at each quality rating. The blue line is a linear fit. I’ll look at other features that contribute significantly to quality.
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
The red line here is a linear fit. The quality slightly increases as fixed acidity increases.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Too much volatile acid leads to an unpleasant, vinegar taste, so the quality of wines with higher volatile acid tend to receive a lower quality rating.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The overall trend is quality slightly increases with more citric acid in the wine as citric acid adds flavor to the wine.
##
## Pearson's product-moment correlation
##
## data: wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
Although the trend looks weak on the plot, but the statistical analysis indicate a negative coefficient. It would be interesting to see how much chlorides feature can improve our prediction model when I perform the multivariate analysis.
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
Total sulfur dioxide seems to affect quality in a negative way.
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$quality
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Sulphates seem to enhance the quality of wines which makes sense because they act as antimicrobial and antioxidant that protect the quality of wines.
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
I expected sugar to be a feature that contribute to quality as well, maybe I have to look at this relationship again in multivariate analysis.
None of the relationships look very linear. Some of them are relatively easier to spot when look at their scatter plots. Others require performing statistical analysis to help identify. Part of the reason that the relationships are generally not easy to see is that quality rating are all integers, and 82% of wines have either 5 or 6 quality rating.
The quality slightly increases with higer fixed acidity. It’s hard for me see that on the scatter plot, but statistical analysis yields a postive correlation coefficient with more than 99% confidence.
The quality drops with more volatile acid, this is not as clear for low volatile acidity range, but it becomes more obvious at higher range, since too much volatile acid creates an unpleasant taste.
Citric acid is usually used to improve the flavor of wines. This relationship is again not so clear at lower citric acid range, but becomes somewhat clearer at higher range.
Quality seems to decrease with chlorides in the wines. This is not very clear on the scatter plot either.
Quality slightly drops with more total sulfur dioxide as well, This is relatively obvious on the scatter plot for wines with 5 or higher quality.
I was a little suprised that the correlation coefficient for sulphates and quality is slightly higher than that of citric acid and quality, since citric acid enhances flavor while sulphates is only there as antimicrobial and antioxidant. I guess maybe the quality drops significantly without proper protection from the addition of sulphates. But if we focus at range with sulphates > 1.0, it seems too much sulphates actually reduce the quality of wines. It’s just most of the wine fall into the range where this is not the case.
Alcohol seems to be the most important feature that affects quality here. Generally, the quality of wines increases as the alcohol level increases.
Yes.
I was very surprised to see that pH slightly increases as volatile acidity(acetic acid) increases. At first, I thought this must be a somewhat random result for our particular dataset. But after doing some research, I realized acetic acid has higher pH compared to citric acid with the same molar concentration, and the molar mass of citric acid is much higher than that of acetic acid, so the pH for acetic acid is much higher than that of citric acid with the same density. When adding acetic acid to a relatively acid environment, acetic acid can probably increase pH. Or the real reason is other features that increase pH happen to coincide with higher acetic acid level.
The ohter relationship that surprised me was that citric acid decreases as acetic acidity increases. I was expecting them to be independent of each ohter. After searching online, I found out citric acid tends to be converted into acetic acid during fermentation, which might be the reason for this odd relationship.
The strongest relationship is the relationship between fixed acidity and citric acid, but that’s just because a large portion of fixed acid is just citric acid. So maybe I should only use citric acid when predicting quality, since they’re overlapping too much. The feature that affects quality the most is alcohol, the correlaiton coefficient is 0.476.
Again the main goal is to understand important determining factors of quality, in other words, the subjective variable quality as a function of different objective measurable features. Since the correlation coefficient between alcohol and quality is the highest among all input features, I will mostly use alcohol content as x variable in the following plots while using another input variable as the color. If the correlation is significant enough between the second input variable and quality, I should be able to observer a pattern in how color changes while holding alcohol content constant.
Colors in this plot are so close because of the existence of the few wines with extremely high residual sugar levels. I will make another plot wihout high sugar levels.
It still doesn’t look like sugar plays an important role in determining quality.
Quality is overall higher for lower volatile acidity.
Quality is higher for higher citric acid level.
I need to narrow down the range of chlorides content so colors are not so similar.
The pattern is not as clear for chlorides. I will still add chlorides in my predicting model and see how much difference it makes.
Again, I need to narrow down the range.
The pattern is not clear either, but there seems to be relatively more points with more sulfur dioxide for lower quality.
Quality is higher for higher sulphates level.
##
## Calls:
## qual.m1: lm(formula = quality ~ alcohol, data = wine)
## qual.m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## qual.m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid,
## data = wine)
## qual.m4: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## total.sulfur.dioxide, data = wine)
## qual.m5: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## total.sulfur.dioxide + sulphates, data = wine)
## qual.m6: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## total.sulfur.dioxide + sulphates + chlorides, data = wine)
##
## =================================================================================
## qual.m1 qual.m2 qual.m3 qual.m4 qual.m5 qual.m6
## ---------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.055*** 3.248*** 2.843*** 2.985***
## (0.175) (0.184) (0.194) (0.200) (0.205) (0.206)
## alcohol 0.361*** 0.314*** 0.314*** 0.302*** 0.295*** 0.276***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017)
## volatile.acidity -1.384*** -1.343*** -1.307*** -1.222*** -1.104***
## (0.095) (0.114) (0.114) (0.112) (0.115)
## citric.acid 0.068 0.106 -0.043 0.065
## (0.103) (0.103) (0.104) (0.106)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## sulphates 0.721*** 0.908***
## (0.103) (0.111)
## chlorides -1.763***
## (0.403)
## ---------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.317 0.324 0.344 0.352
## adj. R-squared 0.226 0.316 0.316 0.322 0.342 0.349
## sigma 0.710 0.668 0.668 0.665 0.655 0.651
## F 468.267 370.379 246.976 190.618 166.962 143.910
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1621.596 -1614.095 -1589.749 -1580.192
## Deviance 805.870 711.796 711.603 704.957 683.814 675.689
## AIC 3448.114 3251.628 3253.192 3240.189 3193.499 3176.384
## BIC 3464.245 3273.136 3280.078 3272.452 3231.138 3219.401
## N 1599 1599 1599 1599 1599 1599
## =================================================================================
As I expected, the linear model here doesn’t work that well. With a linear model, the features I selected only explain 35% of the change in quality.
Now I will look at density.
Density of wines is higher for higher fixed acidity level and more residual sugar, which makes a lot of sense.
The relationship between density and alcohol is even more obvious than that between density and sugar.
##
## Calls:
## den.m1: lm(formula = density ~ fixed.acidity, data = wine)
## den.m2: lm(formula = density ~ fixed.acidity + alcohol, data = wine)
## den.m3: lm(formula = density ~ fixed.acidity + alcohol + residual.sugar,
## data = wine)
##
## ================================================
## den.m1 den.m2 den.m3
## ------------------------------------------------
## (Intercept) 0.991*** 0.999*** 0.999***
## (0.000) (0.000) (0.000)
## fixed.acidity 0.001*** 0.001*** 0.001***
## (0.000) (0.000) (0.000)
## alcohol -0.001*** -0.001***
## (0.000) (0.000)
## residual.sugar 0.000***
## (0.000)
## ------------------------------------------------
## R-squared 0.446 0.654 0.746
## adj. R-squared 0.446 0.654 0.746
## sigma 0.001 0.001 0.001
## F 1287.167 1508.935 1562.809
## p 0.000 0.000 0.000
## Log-likelihood 8234.081 8610.211 8857.637
## Deviance 0.003 0.002 0.001
## AIC -16462.161 -17212.423 -17705.274
## BIC -16446.030 -17190.914 -17678.388
## N 1599 1599 1599
## ================================================
With all three features I selected and a linear model, 74.6% of the change in quality is explained.
Next I’ll analyze the positive correlation coefficient between pH and volatile acidity.
Since wines tend to contain less volatile acid if they have more citric acid, it’s possible that the citric acid is simply a more dominant factor, pH is lower with more citric acid, more citric acid usuallly means less volatile acid, which resulted in the positive correlation coefficient. If this is the main reason, if holding citric acid constant, I would expect pH to still be lower with more volatile acid. Next plot will tell if that’s really the case.
First I’d like to note that I used exp(-pH) because pH is the negative log of the acitivity of hydrogen ions, which is really “the true acidity”.
Given the same citric acid level, it’s not clear to me whether pH tends to be higher or lower with more volatile acid. This probably means the above guess is not the main reason for the positive correlation coefficient. Thus my previous analysis may still be true: in a generallly acid environment, adding small amount of volatile acid can in fact increase pH. Because volatile acid content tends to decrease with higher citric acid content, I want to perform a linear fit with the two features being accounted for separatedly.
##
## Calls:
## pH.m1: lm(formula = exp(-pH) ~ citric.acid, data = wine)
## pH.m2: lm(formula = exp(-pH) ~ citric.acid + volatile.acidity, data = wine)
##
## =======================================
## pH.m1 pH.m2
## ---------------------------------------
## (Intercept) 0.033*** 0.030***
## (0.000) (0.001)
## citric.acid 0.016*** 0.018***
## (0.001) (0.001)
## volatile.acidity 0.003***
## (0.001)
## ---------------------------------------
## R-squared 0.297 0.306
## adj. R-squared 0.297 0.305
## sigma 0.005 0.005
## F 675.870 351.099
## p 0.000 0.000
## Log-likelihood 6283.552 6292.912
## Deviance 0.036 0.036
## AIC -12561.103 -12577.824
## BIC -12544.972 -12556.316
## N 1599 1599
## =======================================
Here the linear model actually shows a positive coeffient for volatile acidity. My y variable is exp(-pH), so this means volatile acidity has the same effect on pH as that of citric acid in the sense that they both contribute to acidity. But the significance of contribution from volatile acid is much lower than citric acid, only about 1/6 in terms of the magnitude of coefficients. Since the coefficient is so low, the effect of volatile acidity is not nearly as important as citric acid.
The most important feature that contributes to quality is alcohol. The other relatively important feature are :
But linear model does not work very well in predicting quality with the features at hand.
Density is closely related to fixed acidity, alcohol and residual sugar. Wines are heavier with more fixed acid and sugar, less alcohol.
At first glance, the correlation coefficient between pH and volatile acid is positive, which seemed a little counter intuitive to me. After seperatating the effect from citric acid, volatile acid seems to also reduce pH, but to a much less extent comparing to citric acid.
Yes, I used linear model to predict quality, and study density and pH. However, the linear model on quality was not very good since quality does not depend on the features quite linearly. It’s a very subjective feature obtained from a small number of evaluations made by wine experts, so it’s likely to have a lot of randomness.
The density was described relatively well by linear model using alcohol, fixed acidity and sugar, because it’s completely objective, and these are likely the most important features that affect density. It’s also worth noting density is basically an average of all the ingredients in wines, so linear model should capture the key variations rather well.
I also used linear model to look at the exp(-pH) and volatile acidity, which shows that when holding citric acid constant, volatile acidity also reduces pH(increases exp(-pH)).
My main goal is to better understand what input features affect quality and how much they affect the rating so it would 82.5% of the wines have quality of 5 or 6, meaning most wines are just considererd as of average quality. The distribution is similar to that of a normal distribution. Minimal quality rating is 3 and maximal quality rating is 8.
Among all features in the dataset that affect quality rating, the correlation coefficient between alcohol and quality is the highest. The 95% confidence interval of the coefficient is [0.4374, 0.5132]. Thus it makes sense to look at the scatter plot of quality vs. alcohol(quality is a function of the rest of the features). In the plot, I’ve made quality as the x axis so the lines span across all quality ratings. Although the correlations is not very strong, it’s still clear the wines tend to have better quality with higher alcohol content. The red line is the median quality at every different quality value, the blue line is a linear fit.
Among all features, alcohol and volatile acidity are the two most significant features in determining quality of wines. Generally speaking, a combination of high alcohol content and low volatile acidity makes a better wine. The coefficients between these two features and quality are 0.4762 and -0.3906 respectively. In the plot, the wines with medium to dark blue colors(7 and 8 quality ratings) are mostly in the top left part of the plot which has high alcohol content and low volatile acidity. The wines with orange and red colors(3 and 4 quality ratings) are mostly scattered within the bottom right part of the plot which has low alcohol content and high volatile acidity. The rest of wines with rating 5 or 6 comprising 82.5% of the wines, are located somewhere in between on the plot. Although the correlation coefficients were not very high, but the clear pattern demonstrated by the plot still motivated me in trying out the linear model on this dataset.
The red wine dataset has contains 12 features on 1599 different wines. 11 out of the 12 are chemical properties of wines and 1 of them is quality rating evaluated by at least 3 wine experts. My main goal was to understand the dataset and be able to predict quality with the chemical properties.
After performing exploratory data analysis on this wine quality dataset, I’ve identified the most important features that determine the wine quality: alcohol, volatile acidity, sulphates and citric acid, total sulfur dioxide and chlorides content also play less important roles. However, quality is a very subjective feature, so my attempt in predicting it with linear model was not very successful, but this analysis still revealed the general pattern. I was particularly frustrated by the fact that none of the correlations stand out as much as those in the diamond dataset did. The fact that 82.5% of wines have quality of 5 or 6 make it so that I’m almost trying to predict an boolean variable: if properties add up, quality is 6; if not, quality is 5. This really limited the performance of my linear model. I also looked at how density and pH vary based on their relevant features and gained better understanding of how these objective quantities change. During the analysis, I struggled to understand the correlation between citric acid and volatile acid. Then I found out about the tendency for citric acid to convert to volatile acid. The linear model for density worked relatively well due to the fact that everything is physically measurable so it’s much more predictable by nature.
To further study how to predict wine quality, I would try to obtain a larger dataset with more evaluations on every single wine, so the quality feature is less random. I would also consider changing the way quality is defined, currently it’s the median of all evaluations with all evaluations being integers, so many wines have the exact same quality ratings, the fine differences between different wines due to the differences in their other features were rounded off, making quality very hard to predict. Thus I think taking the mean after getting rid of outliers might be a better way for the purpose of predicting quality.